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Abstract 

Model selection is central to statistics, and 
many learning problems can be formulated 
as model selection problems. In this paper, 
we treat the problem of selecting a max- 
imum entropy model given various feature 
subsets and their moments, as a model se- 
lection problem, and present a minimum de- 
scription length (MDL) formulation to solve 
this problem. For this, we derive normalized 
maximum likelihood (NML) codelength for 
these models. Furthermore, we prove that 
the minimax entropy principle is a special 
case of maximum entropy model selection, 
where one assumes that complexity of all the 
models are equal. We apply our approach to 
gene selection problem and present simula- 
tion results. 



1 INTRODUCTION 

Given a sequence of observations, the aim of model 
selection is to select a model from a set of candidate 
models that best 'explain' the data. The model chosen 
should be complex enough to explain the observations, 
but it should also be simple enough so that it may gen- 



eralize well for future observations (Grunwald 20041 



Trying to create a balance between these two oppo- 
site aims forms one of the most important problems 
in statistics. It is possible to formulate many learn- 
ing problems as probabilistic model selection prob- 
lems. Linear regression is one such example, where 
we make the assumption of Gaussian noise in order 
to transform the problem of polynomial fitting to a 
probabilistic model selection problem. 

A lot of model selection techniques exist in statistics 
and machine learning literature. The most popular 
statistical techniques for model selection are Akaike 



information criterion (BIC)(Schwarz 1978[) and mini- 
mum description length (MDL) ( [Rissanen "|1978 D . The 



MDL principle was formalized by Rissanen 



19781 



The basic idea behind MDL principle is to equate com- 
pression with finding regularities in the data. Since 
learning often involves finding regularities in the data, 
hence learning can be equated with compression as 
well. Hence, in MDL, we try to find the model that 
yields the maximum compression for the given obser- 
vations. 

In the past two decades, the normalized maximum like- 
lihood (NML) code, which is a version of MDL, has 
gained popularity among statisticians. This is partic- 
ularly because of its minimax properties, as stated in 



(Shtar'kov 1987), which the earlier versions of MDL 



did not possess. Efficient methods for computing NML 
codelengths for mixture models have been proposed in 



(Kontkanen & Myllymaki 2005 Hirai & Yamanishi 



2011). In both these papers, the aim of model selec- 



tion is to decide the optimum number of clusters in a 
clustering problem. 

Maximum entropy models are important class of sta- 
tistical models that have been studied extensively in 



statistics (Good, 1963 Griinwald & Dawid 2004) and 



information theory (Csiszar & Shields 2004). Appli- 



cations include natural language processing (Berger 



et al. 1996) and image processing (Zhu et al. 19971 



Maximum entropy principle suggests a way of select- 
ing probability distribution, given information in the 
form of moments of some functions of underlying ran- 
dom variables. Here, one has to decide, a priori, on 
the amount of 'information' (for example, number of 
moments) that should be used from the data to fix 



the model. The minimax entropy principle (Zhu et al. 



1997 ) states that for given sets of features for the data. 



information criterion (AIC)( Akaike 1974), Bayesian 



one should choose the set that minimizes the maximum 
entropy. However, it can easily be shown that if there 
are two feature subsets <&! and $2 and $1 C $2: the 
minimax entropy principle will always prefer $2 over 
$1. Hence, though the minimax entropy principle is 



a good technique for choosing among sets of features 
with same cardinaHty, it cannot decide when the sets 
of features have varying cardinality. 

In this paper, we formulate the problem of selecting a 
maximum entropy model given various feature subsets 
and their moments, as a model selection problem. We 
derive NML codelength for maximum entropy mod- 



els, minimum I-divergence models (Csiszar 1975) and 



show that our approach is a generalization of mini- 
max entropy principle. We also compute the NML 
codelength for discriminative maximum entropy mod- 
els. Finally, we apply our approach to gene selection 
problem for Leukemia data set. 

The rest of the paper is organized as follows. In Sec- 
tion [2| we briefly discuss about maximum entropy 
models and NML. In Section [3j we derive the NML 
codelength for maximum entropy models. This is fol- 
lowed by derivation of NML codelength for maximum 
entropy discriminative models in Section |4] Finally, in 
Section[5j we apply the results obtained in the previous 
sections for gene selection for Leukemia data set. 



2 PRELIMINARIES AND 
BACKGROUND 

Let X denote the sample space, and A(A') denote the 
set of all probability distribution on X. A"" denotes 
the sample space of all data sequences of size n. x" 
denotes an element in A"", where x = [xi, ...,3;^) is a 
vector in X. 



2.1 MINIMUM DESCRIPTION LENGTH 
PRINCIPLE 

MDL principle states that when one has to choose from 
a set of competing models to explain the data, one 
should choose the model that leads to best compression 
of the data. In order to compress the data, we need a 
code for the data. The two-part MDL code worked by 
first constructing a code for the parameters, followed 
by a code for the data given the parameters. A formal 



introduction to MDL is given in (Grunwald 20041 



For the sake of completeness, we include some of the 



material discussed in ( Grunwald 2004 1 in this paper 



Let be a family of probability distributions on the 
sample space X. The two-part code encodes the data 
sample D = {xi,X2, ...,x„} e X" by first encoding a 
distribution p E M and then the data D with the help 
of the distribution p. The best hypothesis to explain 
data D is then the one that minimizes the sum L{p) + 
L{D\p) ( |Grunwald[ [20041 ). Thus, according to MDL 
principle 



L{D) = argmin [L{p) + L{D\p)] , (1) 

where L{D) is the codelength of the data, L{p) is 
the codelength of distribution and L{D\p) is the code- 
length of data given the distribution. 

L{D\p) is a measure of the error of data D with respect 
to distribution p. Hence, if p approximates D well 
enough, L{D\p) should be small and vice versa. By 
Kraft's inequality, there exists a codelength function 
Lp on A"" given by 



Lp{D)^~\og{p{D)). 



(2) 



Here, we have dropped the ceiling function because we 
are not concerned with actual codes but codelengths. 
We an use Lp{D) as L{D\p) directly, since it is the 
unique minimizer of expected codelength, when p is 
indeed the true distribution. 

There are several ways for constructing L{p). How- 
ever, it can be shown that for any choice of L{p), the 
two-part MDL code is incomplete i.e., there always 
exist some redundant codes. Hence, we can always 
construct a more efficient code by removing those re- 
dundancies. Hence, we need a code, that satisfies some 
conditions of optimality. The normalized maximum 
likelihood code is one such code. 

To formalize the definition of normalized maximum 
likelihood, we need the following notion of 'regret' 



(Grunwald 2004) 



Definition 1. Let M. he a model on A"" and let p he 
a probahility distribution on A" . The regret of p with 
respect to M for data sample x" is defined as 



7^(p) = -logp(x") 



mm 

peM 



logp(x")] 



The regret is nothing but the extra number of bits 
needed in encoding x" using p instead of the optimal 
distribution for x" in Ai. The worst case regret de- 
noted by TZmax is defined as the maximum regret over 
all sequences in A"" 



Ti-maxip) = max 



log(p(x")) — min— logp(x") 

peM 



Our aim is to find the distribution p that minimizes 
the maximum regret. To this end, we define the com- 
plexity of a model COMP(A^) as 

COMP(X)=log/" P(x"|0(x"))dx" . (3) 
Furthermore, the error of a model is defined as 



ERR(X,x") = min [- log(]5(x" 

peM 



(4) 



The following result is due to Shtar'kov (1987). 

Theorem 2.1. // the complexity of a model is finite, 
then the minimax regret is uniquely achieved by the 
normalized maximum likelihood distribution given by 



Pnml 



(x") = 



p(x"|6l(x")) 



.p(y"l^(y"))dy" 



The corresponding codelength —log{pn,ni{x^)) also 
known as the stochastic complexity of the data sample 
x" is given by 



NML(7W,x") = ERR(X,x") + COMP(X) . (5) 



2.2 MAXIMUM ENTROPY MODELS 



The principle of maximum entropy (Jaynes 1957) 
states that when one has to choose from a set of 
competing explanations for a set of observations, one 
should choose the explanation that has maximum un- 
certainty. The uncertainty mentioned here is the Shan- 
non entropy given by 



H{p) 



P(x) log(p(x)) dx 



(6) 



The maximum entropy model for a given set of ob- 
servations is constructed by first deciding on the in- 
formation to use from the observations. Generally, a 
set of functions are chosen and their empirical means 
are computed. They are then equated to the expected 
means of these functions, thereby forming a set of con- 
straints . The set of probability distributions that sat- 
isfy these constraints are known as linear family. The 
maximum entropy model is then obtained by solving 
the resultant optimization problem. 

Let X be a random variable taking values in X , whose 
probability distribution we wish to estimate. Let 
$ = {01, (/)m} be a set of functions of random vari- 
able whose empirical values have been equated to their 
expected values in the resultant optimization problem. 
The resultant linear family >C$,x" is given by the set of 
all probability distributions that satisfy the constraints 



p(x)0fc(x)dx = 4.(x"), l<fc<m, (7) 



where (/'(x") is the empirical estimate of (j> for the data 
sequence x". 



By the method of Lagrange multipliers, it is easy to 
see that the solution to this optimization problem has 
the form 



p{x) = exp -Ao - ^ Xk4>k{x) 



(8) 



fe=i 



Here, Ao is a normalizing constant. We use the no- 
tation A = (Aq, A„j). to represent the set of La- 
grange multipliers in the above equation. The val- 
ues of A for the maximum entropy distribution can be 
computed by maximizing the log-likelihood function 
Z(x"|A) given by 



/(x"|A) = -log(p(x";A)). (9) 

3 MAXIMUM ENTROPY MODEL 
SELECTION USING MDL 

3.1 PROBLEM DEFINITION 

The problem that we intend to address in this paper 
is as follows. 

Let <i>; be a collection of functions from A" to M, 1 < 
I < r. The maximum entropy model A^*, is defined 
as the set of probability distributions p G A(A') that 
have the following form; 



p(x) = exp -Ao,i - ^ Afc,;(?!)fcj(x) 



(10) 



k=l 



where A = (Aq, A„,,i) e 
normalizing constant. 



Here, Aq,; is the 



Given a set of maximum entropy models characterized 
by their function set , 1 < Z < r , we use NML code to 
choose the model that best describes the data. Since, 
we are addressing the particular problem of choosing 
appropriate function sets for maximum entropy mod- 
eling of data, we term it as maximum entropy model 
selection. 

As a special case, one can choose ^(x) = x^, 1 < i < 
d, 1 < j < rui, forall The aim is to find the best 

tuple (mi, ..jmii) £ N"* for the data. One can further 
simplify the problem and assume mi = m2 = ... = 
rud = m. The aim is then to find the best m. 

3.2 NML CODELENGTH 

The NML codelength of a data sequence x" for a given 
model M is composed of two parts: (i) the error code- 
length and (ii) the complexity of the model. 



Proposition 3.1. Error codelength of data sequence 
x" for the maximum entropy model is n times the 
maximum entropy of the corresponding linear family 

ERR(7W<i>,x") = nH{p*^„), (11) 

where p*„ is the maximum entropy distribution of 
£$^x" given by Q. 

Proof. First, we compute the error codelength of the 
data sequence x" for the model A^$. By definition, 



denote the empirical mean of (/)fe for the data sequence 
x". 

The corresponding entropy is given by 



m 

=AS + ^A^^fe(xW) , (16) 



fc=i 



ERR(7W<i.,x") = min {- log(p(x"))} . (12) 

peMis. 



Using definition of from equation (10), we get 



ERR(7W<t,,x") 



: mm 

A 



: mm 

A 



: mm 

A 



-J] -Ao-5]Afc0,(x« 

m 11 

"Ao + 5]Afc^</>fe(x«) 

A.— 1 i^l 

m 

nAo + ^Afe (n0fe(x")) 

k=l 

min ( Ao + ^Afc<^fc(x") ) 



k=l 



(13) 



where (/)(x") is the sample estimate of (p for the data 
sequence x" and A = (Aq, A„i) G M™. 

Using Lagrange multipliers, it is easy to see that the 
maximum entropy distribution for the linear family 
Cg, X" has the form 



p*{x) = exp I -Aq - ^ Afc(?!)fe(x) j for all x e A" , 

(14) 



k=l 



for some A* = (Ag, A;;;^). The parameters A* can be 
obtained by maximizing the log likelihood function. 



A* = argmax 
A 



: argmm 

A 



^ -Ao-^A,.^fe(x«) 

m 

Ao + ^Afc4(x") . 



k=l 



(15) 



where we remove the negative sign to change argmax 
to argmin. The notation ^fc(x") is used is used to 



where the last equality follows from the definition of 
£$^x" in equation ([t]). By combining equations (15) 
and (16 1, we get 



H(p*n) = min 



Xo + ^XkM-^" 



k=l 



By combining equations (17) and (13), we get 



ERR(X$,x") =ni?(p;„) 



(17) 



(18) 



□ 



Corollary 3.2. Complexity of the maximum entropy 
model is given by 



COMP(X<i,) = log / exp(-ni/(p;„)) dy" , 

(19) 

where p*„ is the maximum entropy distribution of 

Proof. By definition, we have 

C0MP(7WcE,) = log / max p(y") dy" . (20) 



Now, 



max p(y") =cxp( max logp(y")) 

= exp(-nil(p*(x"))) , (21) 

where we have used the definition of error in ^ and 
its relationship with entropy in (11) to get the result. 



Replacing equation (21) in (20), we get the desired 
result. □ 

Hence, the NML codelength (also known as stochastic 
complexity) of x" for the model M$ is given by 



NML(A^$,x") =nH{p: 
+ lo. 



exp(-ni/(p;„ ))dy" 



(22) 



This is the classical minimax entropy principle given 
in (Zhu et al. 1997). Hence, the minimax entropy 



principle is a special case of the MDL principle where 
the complexity of all the models are assumed to be 
the same and the models assumed are the maximum 
entropy models. 



3.3 GENERALIZATION OF MINIMAX 
ENTROPY PRINCIPLE 



3.4 MAXIMUM ENTROPY MODELS 
WITH KULLBACK'S PRIOR 



In this section, we show that the presented NML 
formulation for maximum entropy is a generalization 



of the minimax entropy principle (Zhu et al. 19971, 



where this principle has been used for feature selec- 
tion in texture modeling. 

Let $1,...,$; be sets of functions from X to the set 
of real numbers. Corresponding to each set $p, there 
exists a maximum entropy model and vice- versa. 
For example, if — {(j)pi, 4>pmp}, the correspond- 
ing maximum entropy model is 



p(x) 



exp -Ao - ^ \k<f)pk{y^) Vx e A" 



k = l 



(23) 



The MDL principle states that given a set of models 
for the data, one should choose the model that mini- 
mizes the codelength of the data. Here, the codelength 
that we are interested in is the NML codelength (also 
known as stochastic complexity). Since, there exists 
a one-one relationship between the maximum entropy 
models and the function sets $p, the model selection 
problem can be reframed as 



$ = argmin 

$|e4<fc<P 



log 



exp(-ni?(p;„,,))dy" 



(24) 



If we assume that all our models have the same com- 
plexity, then the second term in R.H.S can be ignored. 
Since n, the size of data sequence is a constant, the 
model selection problem becomes 



$ = argmin H{p*^,. ,,) 
$fc,i<fc<p 

= argmin max H{p) 

'S>k,i<k<p pe-C*fc,x" 



The Minimum I-divergence principle is a generaliza- 
tion of the maximum entropy principle that consid- 
ers the cases where a prior estimate of the distri- 
bution p is available. A maximum entropy model 
M<s,^q,^ = 0m} with Kullback's prior q is the 

set of all probablity distribution p E A (A") that have 
the form 



p(x) = q{x) exp -Ao 



k=l 



Afc(?!'fc(x) 



(25) 



for all a; G X. We state without proof, the NML 
codelengths for such models. The proof is similar to 
proof of NML codelength for maximum entropy mod- 
els without prior. 

Proposition 3.3. Error codelength of the data se- 
quence x" for the maximum entropy model with 
prior q is given by 

ERR(A^$,„x") = -log(g(x")) - ni^L(p;,„ ||g) , 

where p* is the I-projection of q on the linear fam- 
ily >C$,x"- In the special case, when the prior is the 



uniform distribution, we get back equation (111 

Proposition 3.4. Complexity of the model A^$.g is 
given by 



COMP{M<l>,q) 

= log 



g(y")exp(i^L(p:y„||<z))dy 



where Pq^y^ is the I-projection of q on the linear fam- 
ily £$^yii. In the special case, when the prior is the 



uniform distribution, we get back equation ( 19 I 



Hence, the NML codelength of x" for the model A4<s,^q 
is given by 



NML(X$,x,., g) = - log(g(x")) - nKL{p*^^^^) 



log / q(y")exp(ifL(p;y„||g))dy" 



(26) 



When q is uniform, A^$,x", 9 — 



4 DISCRIMINATIVE MODELS 

Discriminative methods for classification, model the 
conditional probability distribution p(c|x), where c is 
the class label for data x. Maximum entropy based 
discriminative classification tries to find the probabil- 
ity distribution with the maximum conditional entropy 



H{C\X) subject to some constraints (Berger et al. 



1996), where C is the class variable. Initially, informa- 



tion is extracted fom the data in the form of empirical 
means of functions. These empirical values are then 
equated to their expected values, thereby forming a set 
of constraints. The classification model is constructed 
by finding the maximum entropy distribution subject 
to these sets of constraints. We use MDL to decide 
the amount of information to extract from the data in 
the form of functions of features. A straightforward 
application of this technique is feature selection. 

The maximum entropy discriminative model A^$, 
where $ = {(j>i, (pm} is the set of all probability 
distributions of the form 

P(c|x) = ^^^P(-pfe--y,^';» , , (27) 
Lcec '^xpl- Afc0fe(x, c)) 

Let us denote the denominator in above equation as 
Z(x). Let £$,x",c" be the linear family given by 



^p(c|x)p(x)0fe(x, c) 

^ cec 

n ^ 

- V0fe(x«,cW) ,l<fc<m^ . 



Since, we are not interested in modelling p(x), we use 
the empirical distribution p(x)to approximate p(x) as 



tion p(x) is given by 



done in (Nigam et al. 1999). The empirical distribu- 



p(x) = <^ " 

otherwise . 



(28) 



Hence, the constraints become 

n n 

EEp(c|x«)0,(x«,c) = ^0,(x«,c«) . (29) 



1=1 cGC 



bits. If, however, he uses the data to compute a proba- 
bility distribution over the class labels, and then com- 
press c" using that distribution, he may get a shorter 
codelength for c". His goal is to minimize this code- 
length, such that the receiver can recover the class 
labels from this code. 

Proposition 4.1. The Error codelength o/c"|x" for 
the conditional model M,^ is equal to n times the max- 
imum conditional entropy of the model C^^x",c"- 

Proof. Error of the conditional model is given by 
ERR(X<E,,c"|x") = min -logp(c"|x") 



: mm 

A 



.fe=l 



(30) 



where we have used similar reasoning as in ( 13 ) to get 



the last statement. Also, the maximum conditional 
entropy distribution can be obtained by maximizing 
the corresponding log-likelihood function. Hence, 

p* — argmaxlogp(c"|x") 
Correspondingly, 

m n n 

A* =argmin^Afc^0fe(x«,c«)-^^logZA(xW) . 



fe=l i=l 



4=1 



(31) 



The corresponding conditional entropy is given by 
H{P*)^- I J]p*(c|x)p(x)logp*(c|x) 



cec 



n 

+ EE^'*(^i^''^)iog^A(x«) 

i^l c£C 
m / n \ 

^Ad^<^.(x«,c«) 



(32) 



.fe=i 



\i=l 



+ ^logZA(x(')) 



t=l 



(33) 



Here the first equality follows from the definition of 



As discussed in (Tabus et al. 20031, the sender- conditional entropy as used in (Berger et al. 19961 



receiver model assumed here is as follows. Both sender 
and receiver have the data x" . The sender is interested 
in sending the class labels c". If he sends the class la- 
bels without compression, he needs to send nlog(|C|) 



We use the definition of p in equation ( 28 ) to convert 



the integral to a summation. By using ( 29 ) and the 



fact that p{c\x^^^) must sum up to 1, wc obtain the 
fourth equality. 



Using (31) and (33), we obtain 

^ r m 11 

5]Afc5](/.fe(xW,c«) (34) 



H(p*) =— min 
n A 



.k=l i=l 



-^logZA(xW) 



(35) 



Replacing the above equation in (30 1, we get the de- 
sired result. 



□ 

Corollary 4.2. The complexity of the conditional 
model is given by 

COMP(X<i.) = exp(-?7iJ(p;„)) • (36) 

5 APPLICATION TO GENE 
SELECTION 

We use gene selection as an example to illustrate 
discriminative model selection for maximum entropy 
models. The dataset used is Leukemia dataset avail- 
able publicly at ht t p :7 / www .genome. wi.mit.edu'. The 
dataset was also used in ( Tabus et al.\ 2003) to il- 
lustrate NML model selection for discrete regression. 
The data set consists of two classes: acute myeloid 
leukemia (AML) and acute lymphoblastic leukemia 
(ALL). There are 38 training samples and 34 inde- 
pendent test samples in the data. The data consists 
of 7129 genes. The genes are preprocessed as recom- 



mended in (Dudoit et al. 2002). 



Assuming the sender-receiver model discussed above, 
the sender needs 38 bits or 26.34 nats in order to send 
the class labels of training data to receiver. If the NML 
code is used, the sender needs 24.99 nats. Since the 
sender and receiver both contain the microaaray data, 
the sender can use the microarray data to compress the 
class labels much more than can be obtained wihout 
the microarray data. Specifically, we are interested in 
finding the genes which gives the best compression, or 
the minimum NML codelength. 

We quantize each gene to 5 levels. Other than the ad- 



vantages of quantization mentioned in (Tabus et al. 



2003), quantization is necessary for the current prob- 
lem as the problem of calculating complexity can be- 
come intractable even for moderate n.The constraints 
that we use are moment constraints, that is (/)fe(x) = 
x*^, 1 < fc < m. We vary the value of m from 1 to 7 
to get a sequence of maximum entropy models. The 
NML codelength of the class labels is calculated for 
each such model. The model that results in the mini- 
mum NML codelength is selected for each gene. 

It was observed that for most genes, the NML code- 
length decreased sharply when m was increased from 



Table 1: The top 20 genes and the NML codelength 
achieved by them. The value of m is chosen so as 
to minimize NML codelength. The genes at posi- 
tions 1,2,3,4,5,8,12,14,16,17,18,20 were also selected in 



(Golub et al. 1999) 



Rank 


Codelength 


m 


Gene 


Gene 








index 


identifier 


1 


8.35 


2 


1882 


M27891 


2 


8.45 


2 


4847 


X95735 


3 


8.53 


2 


5039 


Y12670 


4 


8.89 


2 


3320 


U50136 


5 


9.12 


4 


1745 


M16038 


6 


9.43 


3 


248 


D14874 


7 


9.55 


2 


760 


D88422 


8 


10.31 


2 


1834 


M23197 


9 


10.51 


2 


6218 


M27783 


10 


11.11 


2 


2015 


M54995 


11 


12.12 


2 


1926 


M31166 


12 


12.23 


2 


6855 


M31523 


13 


12.79 


2 


4373 


X62320 


14 


12.81 


2 


2020 


M55150 


15 


12.93 


2 


6797 


J03801 


16 


13.00 


2 


3258 


U46751 


17 


13.14 


7 


2288 


M84526 


18 


13.17 


2 


5772 


U22376 


19 


13.19 


2 


4499 


X70297 


20 


13.22 


2 


2121 


M63138 



1 to 2. The change in values of NML codelength was 
less noticeable for m > 2. The variation of NML code- 
length of class labels for a typical gene are shown in 
figure [T] In order to make the changes in NML code- 
length more visible, we skip the NML codelength for 
m=l. 

Our approach for ranking genes is as follows. For each 
gene, we select the value of m that gives the minimum 
NML codelength. We then sort the genes in increas- 
ing order of their minimum NML codelengths. The 
Table [l] shows the NML codelengths of class labels for 
top 20 genes. It can be seen that the minimum code- 
length achieved is 8.35 nats, which is much smaller 
than 24.99 nats achieved without using the microar- 
ray data. Since compression is equated with finding 
regularity according to Minimum Description Length 
principle, hence, it can be stated that the topmost gene 
is able to discover a lot of regularity in the data. 

In the second part of our experimentation, we take all 
the 72 instances (training as well as test). Furthermore, 
we use 3 classes T-cell ALL, B-cell ALL, and AML 
for our gene selection problem. Each gene has been 
quantized to 3 levels. It was found that the test data 
was relatively incompressible as compared to training 



Table 2: The top 10 genes selected for the 3 class prob- 
lem and the NML codelengths of class labels achieved 
for these genes 
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Table 3: The top 10 gene pairs selected by minimax 
entropy principle for the 3 class problem and their clas- 
sif lcation accuracies 
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data. This means that there was little regularity in 
test data as compared to training data. The top 10 
genes selected by MDL for this problem are listed in 
Tabled 

To compare MDL principle with minimax entropy 
principle, we find the optimum gene pairs using the 
minimax entropy principle and MDL principle dis- 
cussed above. We then compare the cross validation 
error using 4:1 partitioning of MDL against minimax 
entropy principle. The classifier used for computing 
classification accuracy is SVM. 

The top 10 gene pairs selected by minimax entropy 
principle are given in Table [3] The top 10 gene pairs 
selected by MDL are given in Table |4] The classifi- 
cation accuracy for the two approaches is plotted in 
Figure |3] 



Table 4: The top 10 gene pairs selected by MDL for 
the 3 class problem and their classification accuracies 
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Figure 1: Variation of NML codelength of class la- 
bels for gene M16038 with the number of moment con- 
straints m 



6 CONCLUSION 



In this paper, we proposed an MDL based approach 
for maximum entropy model selection problem. We 
showed that this approach generalizes minimax en- 
tropy principle. We derived NML codelength in this 
respect, and extended it to discriminative maximum 
entropy model selection. We tested our proposed 
method for gene selection problem and compared 
the simulation results with minimax entropy principle 
based approach. The bottleneck for using MDL for 
model selection in discriminative classification is the 
computation of complexity. More efRcient approxima- 
tions to calculate the complexity need to be developed 
to employ this approach for problems involving larger 
data sets. 
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Figure 2: NML codelength of class labels for the top 
250 genes 
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Figure 3: Classification accuracy for the top 20 gene 
pairs selected by MDL principle compared to the clas- 
sification accuracy of the top 20 gene pairs selected by 
minimax entropy principle 
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